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Abstract 


In  this  paper  we  study  the  problem  of  emulating  Tq  steps  of  an  N<~-node  guest  network  on 
an  Npj-node  host  network.  We  call  an  emulation  work-preserving  if  the  time  required  by  the 
host,  Tjj,  is  0(TqNg/Nj7)  because  then  both  the  guest  and  host  networks  perform  the 
same  total  work^l^fT^NG^to  within  a  constant  factor.  We  say  that  an  emulation  is  real¬ 
time  if  Tjj  =  O(T^),  because  then  the  host  emulates  the  guest  with  constant  delay. 

Although  many  isolated  emulation  results  have  been  proved  for  specific  networks  in  the 
past,  and  measures  such  as  dilation  and  congestion  were  known  to  be  important,  the  field 
has  lacked  a  model  within  which  general  rsults  and  meaningful  lower  bounds  can  be 
proved.  We  attempt  to  provide  such  a  model,  along  with  corresponding  general  techniques 
and  specific  results  in  this  paper*  Some  of  the  more  interesting  and  diverse  consequences 
of  this  work  include: 


1.  a  proof  that  a  linear  array  can  emulate  a  (much  larger)  butterfly  in  a  work-preserving 
fashion,  but  that  a  butterfly  cannot  emulate  an  expander  (of  any  size)  in  a  work- 
preserving  fashion, 

2.  a  proof  that  a  mesh  can  be  emulated  in  real  time  in  a  work-preserving  fashion  on  a 
butterfly,  even  though  any  0(l)-to-l  embedding  of  a  mesh  in  a  butterfly  has 
dilation  fl(log  N), 


3.  a  proof  that  an  N  log  N-node  butterfly  can  be  emulated  in  a  work-preserving  fashion 
on  an  N-node  shuffle-exchange  graph,  and  vice-versa, 

4.  simple  0(N^/log^  N)-area  and  0(N^/2/|0g3/2  N)-volume  layouts  for  the  N-node 
shuffle-exchange  graph,  and 


5. 


an  algorithm  for  sorting  N-numbers  in  O(log  N)  steps  with  high  probability  on  an  N- 
node  shuffle-exchange  graph  with  constant  size  queues. 
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Abstract 

In  this  paper,  we  study  the  problem  of  emulating  Tq 
steps  of  an  No-node  guest  network  on  an  Nh- node  host 
network.  We  call  an  emulation  work-preserving  if  the 
time  required  by  the  host,  Th ,  is  0(TqNg / Nh)  because 
then  both  the  guest  and  host  networks  perform  the  same 
total  work,  Q(TgNq),  to  within  a  constant  factor.  We 
say  that  an  emulation  is  real-time  if  Th  =  0(T< ?),  be¬ 
cause  then  the  host  emulates  the  guest  with  constant 
delay.  Although  many  isolated  emulation  results  have 
been  proved  for  specific  networks  in  the  past,  and  mea¬ 
sures  such  as  dilation  and  congestion  were  known  to  be 
important,  the  field  has  lacked  a  model  within  which 
general  results  and  meaningful  lower  bounds  can  be 
proved.  We  attempt  to  provide  such  a  model,  along 
with  corresponding  general  techniques  and  specific  re¬ 
sults  in  this  paper.  Some  of  the  more  interesting  and 
diverse  conseqeuences  of  this  work  include: 

1.  a  proof  that  a  linear  array  can  emulate  a  (much 
larger)  butterfly  in  a  work-preserving  fashion,  but 
that  a  butterfly  cannot  emulate  an  expander  (of 
any  size)  in  a  work-preserving  fashion, 

2.  a  proof  that  a  mesh  can  be  emulated  in  real  time 
in  a  work-preserving  fashion  on  a  butterfly,  even 
though  any  0(l)-to-l  embedding  of  a  mesh  in  a 
butterfly  has  dilation  fl(logiV), 

3.  a  proof  that  an  N  logiV-node  butterfly  can  be  em¬ 
ulated  in  a  work-preserving  fashion  on  an  IV-node 
shuffle-exchange  graph,  and  vice-versa, 

4.  simple  0(N2/ log2  N)- area  and  0(N3l2/  log3,/2  iV)- 
volume  layouts  for  the  N-node  shuffle-exchange 
graph,  and 

5.  an  algorithm  for  sorting  IV-numbers  in  <9(logiV) 
steps  with  high  probability  on  an  IV-node  shuffle- 
exchange  graph  with  constant  size  queues. 


1  Introduction 

1.1  The  Problem 

In  this  paper,  we  study  the  problem  of  emulating  am 
IVG-node  guest  network  G  =  (Vo,  Eg)  on  am  Nn-node 
host  network  H  =  (Vh,Eh)  where  Nh  <  NG-  Our 
goal  is  to  emulate  To  steps  of  any  computation  on  G 
in  Th  =  STg  steps  on  H  where  S  (the  slowdown  of  the 
emulation)  is  as  small  as  possible. 

The  slowdown  of  the  emulation  must  always  be  at 
least  as  large  as  Nq/Nh  since  G  has  Nq/Nh  times  as 
many  processors  as  does  H .  If  S  =  0(Ng/Nh),  then 
we  say  that  the  emulation  is  work-preservtng  because 
then  the  totad  work  (i.e.,  the  processor-time  product) 
performed  by  the  emulating  network  (Wh  —  ThNh)  is 
within  a  constant  factor  of  the  work  performed  by  the 
guest  network  (WG-=  TgNg )•  Such  emulations  achieve 
optimal  speedup  (to  within  a  constant  factor)  over  se¬ 
quential  emulations  of  G  since  they  use  Nh  processors 
to  solve  a  problem  Q(Nh)  times  faster  than  is  possible 
with  a  single  processor. 

More  generally,  we  say  that  there  is  a  work-preserving 
emulation  of  a  class  of  networks  Q  by  a  class  of  networks 
7 i  with  slowdown  S(N)  if  for  every  N  and  T,  we  can 
emulate  any  T  steps  of  any  S(N)N- node  network  in  Q 
in  0(S(N)T)  steps  on  any  IV-node  network  in  'H.  If 
S(N)  =  0(log“  N)  for  some  constant  a,  then  we  say 
that  the  emulation  is  NC  work-preserving  since  every 
step  of  G  can  be  emulated  in  0( log“  N)  steps  of  H .  If 
S(N )  =  0(Na)  for  some  constant  a,  then  we  say  that 
the  emulation  is  polynomial  time  work-preserving,  and 
so  on.  In  the  special  case  that  S(N)  =  0(1),  we  say  that 
the  emulation  is  real-time.  Real-time  emulations  are  the 
hardest  to  obtain  since  we  require  the  host  network  to 
emulate  a  guest  network  of  the  same  size  with  constant 
slowdown. 

As  a  simple  example,  let  Q  be  the  class  of  linear  ar¬ 
rays,  and  H  be  the  class  of  all  bounded-degree  connected 
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networks.  It  is  well  known  [18]  that  am  JV-node  lin¬ 
ear  array  cam  be  embedded  one-to-one  in  any  connected 
bounded-degree  TV-node  network  with  constant  dilation 
and  congestion.  (By  an  embedding  of  a  graph  G  into 
a  graph  H ,  we  mean  a  mapping  4>  :  G  —*  H  that  maps 
the  nodes  of  G  to  the  nodes  of  H  and  the  edges  of  G  to 
paths  in  H .  The  dilation  of  an  embedding  is  the  length 
of  the  longest  path  <£(e)  corresponding  to  an  edge  of  G. 
The  congestion  of  am  embedding  is  the  largest  number 
of  paths  <f>(e)  crossing  a  single  edge  of  H.  The  load  of 
an  embedding  is  the  maximum  number  of  nodes  of  G 
mapped  to  a  single  node  of  if.  In  a  one-to-one  em¬ 
bedding,  the  load  is  1.)  Hence  any  N-node  bounded 
degree  connected  network  H  can  emulate  any  iV-node 
linear  array  with  constant  slowdown,  and  thus  there  is 
a  real-time  emulation  of  the  class  Q  by  the  class  ~H  ■ 

As  another  simple  example,  consider  the  more  inter¬ 
esting  problem  of  emulating  a  butterfly  on  a  linear  array. 
We  will  prove  that  the  class  of  butterflies  cannot  be  real¬ 
time  emulated  by  the  class  of  linear  arrays.  (This  should 
come  aw  no  surprise,  although  the  proof  is  not  entirely 
trivial.)  However,  there  is  a  simple  work-preserving  em¬ 
ulation  of  the  class  of  butterflies  by  the  class  of  linear 
arrays  with  slowdown  2".  In  particular,  consider  an 
1V2JV-node  butterfly  with  nodes  and  edges 

V  =  {(i,w)|l  <i<N,  to  €  {0,1}"},  and 

E  =  {{(i,  w),{i' ,w'))\i'  =  i  +  l,w'  =  w  or  w'  =  «/')}, 

where  u/‘)  denotes  w  except  that  the  ith  bit  is  changed. 
Then  by  mapping  the  2"  nodes  of  the  form  («,  w)  (where 
w  G  {0, 1}")  to  the  ith  node  of  the  linear  array,  an  N- 
node  linear  array  can  emulate  an  iV2"-node  butterfly 
with  2"  slowdown. 

Seeing  this  elementary  example,  one  is  tempted  to 
ask  if  there  are  faster  work-preserving  emulations  of  a 
butterfly  on  a  linear  array.  In  other  words,  can  we  emu¬ 
late  a  smaller  butterfly  (say  with  polynomial  blowup)  in 
a  work-preserving  fashion  on  a  linear  array?  Although 
the  proof  is  not  obvious,  the  answer  is  no.  There  is 
no  polynomial-time  work  preserving  emulation  of  the 
class  of  butterflies  by  the  class  of  linear  arrays.  Any 
such  emulation  requires  exponential  slowdown.  Alter¬ 
natively,  we  might  wonder  if  a  linear  array  can  emu¬ 
late  any  bounded-degree  network  in  a  work-preserving 
fashion  given  enough  slowdown.  Again,  the  answer  is 
no.  Although  the  linear  array  can  emulate  a  butter¬ 
fly  in  a  work-preserving  fashion,  it  cannot  emulate  any 
expander,  no  matter  how  much  blowup  is  allowed.  In 
fact,  by  combining  these  results  we  can  conclude  that 
even  a  butterfly  is  not  sufficiently  powerful  to  emulate 
an  expander  in  a  work-preserving  fashion. 

We  also  consider  emulations  that  are  not  work¬ 
preserving.  Such  emulations  are  (by  definition)  ineffi¬ 
cient,  and  we  define  the  inefficiency  of  such  an  emula¬ 
tion  to  be  /  =  Wh/Wq.  In  these  terms,  an  emulation  is 
work-preserving  if  it  has  constant  inefficiency.  Many  of 


our  bounds  will  reflect  tradeoffs  between  slowdown  and 
inefficiency.  In  general, 


where  C  =  Nq/Nh  is  the  contraction  of  an  emulation. 
1.2  The  motivation 

There  are  several  good  reasons  for  studying  the  prob¬ 
lem  of  emulating  one  network  on  another  in  a  work- 
preserving  fashion.  For  starters,  this  kind  of  analysis 
gives  us  an  excellent  means  by  which  to  compare  the 
computational  power  of  one  network  relative  to  that  of 
another.  More  importantly,  it  gives  us  an  automatic 
way  to  compile  and  run  algorithms  designed  for  one 
kind  of  parallel  architecture  without  loss  of  efficiency 
on  another.  This  is  provided,  of  course,  that  the  ratio 
of  the  size  of  the  problem  to  the  size  of  the  machine  is 
large  enough.  For  example,  we  have  already  seen  that  a 
small  linear  array  (which  has  a  very  simple  structure)  is 
just  aw  efficient  in  terms  of  work  as  a  very  large  butterfly 
(which  has  a  more  complicated  structure). 

More  generally,  the  study  of  work-preserving  emula¬ 
tions  lies  at  the  heart  of  efficient  parallel  computing. 
Indeed,  one  of  the  central  problems  in  efficient  parallel 
cotnputing  is  the  task  of  mapping  a  collection  of  pro¬ 
cesses  linked  by  precedence  and/or  communication  con¬ 
straints  onto  the  processors  and  routing  network  of  a 
parallel  machine  so  that 

1.  the  processing  load  imposed  on  the  processors  is 
balanced, 

2.  the  communication  between  processors  can  be  han¬ 
dled  efficiently,  and 

3.  the  computation  and  communication  can  be  sched¬ 
uled  so  that  the  necessary  inputs  for  a  process  are 
available  where  and  when  the  process  is  scheduled 
to  be  computed. 

In  other  words,  we  would  like  to  schedule  the  communi¬ 
cation  and  computation  in  a  way  that  takes  maximum 
advantage  of  the  available  hardware  to  minimize  the 
completion  time  of  the  job. 

In  general,  we  can  model  the  computation  to  be  per¬ 
formed  by  a  DAG.  Each  node  of  the  DAG  represents  a 
process  and  each  directe  -1  ed?e  (u,  v)  represents  a  com¬ 
munication  that  must  ta  :  pi',  i  between  u  and  v.  Typ¬ 
ically,  this  communicatioi.  jsents  data  output  from 
u  after  u  is  completed  whicn  is  to  be  input  to  v  be¬ 
fore  v  is  started.  The  parallel  machine  can  be  modeled 
as  an  undirected  network.  The  nodes  of  the  network 
correspond  to  processors,  and  the  edges  correspond  to 
communication  links  between  processors  (and/or  their 
associated  memories).  The  implementation  of  the  com¬ 
putation  to  be  performed  on  the  parallel  machine  then 
corresponds  to  an  embedding  of  the  DAG  in  the  network 
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so  that  nodes  of  the  DAG  are  mapped  to  nodes  of  the 
network  and  so  that  edges  of  the  DAG  are  mapped  to 
paths  in  the  network.  We  may  also  need  to  construct  a 
schedule  that  specifies  the  communication  and  compu¬ 
tation  of  the  DAG  that  is  being  performed  during  each 
step  of  the  network.  This  will  be  particularly  important 
if  the  parallel  machine  is  synchronous. 

In  many  applications,  the  DAG  possesses  a  very  nat¬ 
ural  structure.  For  example,  typical  DAGs  encountered 
in  practice  are  derivitives  of  a  binary  tree,  array,  butter¬ 
fly,  or  shuffle-exchange  graph.  This  is  often  due  to  the 
fact  that  the  DAG  is  associated  with  an  algorithm  whose 
inherent  underlying  structure  is  a  tree  or  array  (as  is  the 
case  for  many  problems  in  numerical  analysis  and  linear 
algebra)  or  a  butterfly  or  shuffle-exchange  graph  (as  is 
the  case  for  Fourier  Transform  and  data  manipulation 
problems).  Alternatively,  it  could  be  that  the  DAG  was 
constructed  from  an  algorithm  specifically  designed  for 
use  on  one  of  these  common  parallel  architectures. 

Similarly,  parallel  networks  also  tend  to  be  very  nat¬ 
urally  structured  and  typically  are  configured  as  trees, 
arrays,  butterflies,  and  the  like.  Hence,  the  mapping 
problem  often  consists  of  emulating  Tq  steps  of  one  Nq- 
node  network  (represented  as  a  TqNq- node  DAG)  on  an 
Nh- node  network  with  a  different  structure.  Ideally,  we 
would  like  to  perform  the  computation  in  0{TgNq /Njj) 
steps,  which  is  precisely  the  problem  of  finding  a  work- 
preserving  emulation  of  one  network  on  another. 

In  "practice,  the  guest  network  can  be  substantially 
larger  than  the  host  network.  For  example,  it  is  not  un¬ 
common  for  a  parallel  machine  with  between  8  and  256 
processors  to  be  emulating  array-based  computations 
involving  hundreds  of  thousands  of  data  points.  In  such 
examples,  even  work-preserving  emulations  with  expo¬ 
nential  slowdown  may  be  within  the  scope  of  practical¬ 
ity.  Indeed,  the  most  important  feature  of  the  com¬ 
putation  is  that  it  be  work-preserving.  In  fact,  the 
notion  of  a  work-preserving  computation  is  important 
enough  that  it  transcends  high-level  architectural  is¬ 
sues  such  as  SIMD  vs.  MIMD,  synchronous  vs.  asyn¬ 
chronous,  small  scale  vs.  large  scale,  and  fine  grain  vs. 
coarse  grain.  For  example,  even  though  issues  involv¬ 
ing  the  timing  of  computations  and  communications  be¬ 
come  muddied  with  asynchronous  architectures,  the  un¬ 
derlying  problem  of  embedding  the  computation  so  as  to 
mimimize  computational  load  and  communication  load 
(independent  of  timing)  still  remains.  As  a  consequence, 
wotk-preserving  emulations  are  just  as  important  for  a 
Dataflow  Machine  as  they  are  for  a  Connection  Machine 
(to  mention  two  architectures  at  opposite  ends  of  the 
spectrum). 

1.3  A  closer  look  at  the  computational  model 

If  we  can  find  an  embedding  of  a  graph  G  into  a  graph 
H  with  constant  dilation,  congestion,  and  load,  then 
it  is  fairly  clear  that  H  can  emulate  G  with  constant 


slowdown.  Is  the  reverse  true?  Somewhat  surprisingly, 
it  is  not.  For  example,  Bhatt,  Chung,  Hong,  Leighton 
and  Rosenberg  [2]  proved  that  any  embedding  of  an  N- 
node  mesh  into  an  iV-node  butterfly  with  constant  load 
requires  dilation  Q(log  N),  the  worst  possible.  At  first 
glance,  it  might  seem  that  this  result  implies  that  there 
is  no  real-time  emulation  of  a  mesh  on  a  butterfly.  As  we 
show  in  this  paper,  however,  this  is  not  the  case.  There 
is,  in  fact,  a  way  to  emulate  T  steps  of  an  iV-node  mesh 
computation  in  0(T)  steps  on  an  N- node  butterfly  for 
any  T. 

In  order  to  understand  how  such  a  contradictory  re¬ 
sult  is  possible,  we  need  to  take  a  closer  look  at  what  it 
means  to  emulate  Tq  steps  of  one  network  in  T#  steps 
on  another.  We  start  by  modeling  the  computation  per¬ 
formed  by  the  guest  network  G  as  a  pebble  DAG  T.  In 
particular,  we  will  have  a  pebble  for  every  node- time 
pair  (v,t)  where  v  is  a  node  of  G  and  0  <  t  <  To-  (Pairs 
of  the  form  ( v ,  0)  correspond  to  inputs.)  In  fact,  we  may 
have  many  pebbles  associated  with  a  single  pair  (v,t), 
which  will  correspond  to  the  same  computation  being 
done  more  them  once.  (This  is  the  trick  that  allows  us 
to  emulate  a  mesh  on  a  butterfly  in  real  time.)  To  com¬ 
pute  any  pebble  labeled  (u,  t),  we  need  as  inputs  pebbles 
labeled  (t>,f  —  1)  and  (t>i,f  —  l),(vj, f—  1), ...,(«*,$—  1), 
where  v\ ,  v? , . . . ,  v*  are  the  neighbors  of  v  in  G.  We  use 
the  directed  edges  of  T  to  denote  this  dependence  in  the 
usual  way. 

Because  many  pebbles  can  have  the  same  label,  there 
are  many  DAGs  T  associated  with  any  graph  G.  In  order 
to  emulate  G  on  H,  we  only  need  to  find  an  embedding 
and  an  acompanying  schedule  of  one  of  these  DAGs  in 
H .  Once  an  embedding  and  schedule  of  a  DAG  is  fixed, 
the  emulation  proceeds  in  a  standard  way.  In  particular, 
during  each  step  of  the  computation,  a  node  of  H  can 

1.  make  a  copy  of  a  single  pebble  that  it  contains, 

2.  send  a  single  pebble  to  a  neighbor,  and/or 

3.  create  a  pebble  with  label  (v,<)  provided  that  it 
already  contains  input  pebbles  with  labels  (r,  t  -  1) 
and  (vi,<  -  I),(v2,<  -  1),. ..  ,(«fc,f  -  1). 

Initially,  we  will  allow  a  node  of  H  to  have  access 
to  any  input,  although  to  use  any  of  these  inputs  in  a 
meaningful  way  will  take  time.  By  the  end  <i  the  emu¬ 
lation,  we  must  have  computed  pebbles  with  all  labels 
of  the  form  ( v,Tg )•  (For  purposes  of  simplicity,  we  will 
use  a  pebble  to  denote  the  state  of  a  processor  of  G 
at  some  particular  time,  as  described  above.  A  more 
general  interpretation  would  be  to  use  a  pebble  to  de¬ 
note  one  of  many  items  (e.g.,  data  and/or  functions) 
stored  within  a  processor.  411  of  our  results  hold  under 
the  more  general  interpretation,  although  some  of  the 
emulation  results  became  more  complicated.) 

By  allowing  several  pebbles  to  have  the  same  label,  we 
dramatically  increase  the  number  of  possible  computa¬ 
tion  DAGs  T  that  correspond  to  a  Tc-step  computation 
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of  G.  This  makes  it  more  likely  that  we  can  find  a  com¬ 
putation  that  can  be  efficiently  emulated  on  some  host 
network  H  (e.g.,  as  is  the  case  with  emulating  a  mesh  on 
a  butterfly),  but  it  also  makes  the  task  of  proving  lower 
bounds  much  more  difficult.  For  example,  in  order  to 
prove  that  H  cannot  emulate  G  in  real-time,  we  must 
show  that  for  some  Tq  ,  there  is  no  DAG  T  associated 
with  a  T<;-step  computation  of  G  that  can  be  emulated 
in  0(Tg)  steps  on  H.  This  can  be  a  formidable  task 
since  T  can  look  very  different  than  G.  Indeed,  at  the 
very  least,  we  must  choose  Tg  to  be  large  since  by  al¬ 
lowing  redundant  computations  of  pebbles,  any  0(1) 
steps  of  any  IV-node  bounded-degree  graph  G  can  be 
computed  in  0(1)  steps  on  any  A- node  graph  H .  (This 
is  because  if  T  —  0(1),  then  any  output  pebble  can 
only  depend  on  0(1)  input  pebbles,  which  can  be  re¬ 
dundantly  computed  locally  since  every  node  of  H  is 
assumed  to  have  access  to  all  input  pebbles.) 

Note  that  when  we  prove  a  lower  bound  on  the  ability 
of  a  graph  H  to  emulate  a  graph  G,  it  does  not  neces¬ 
sarily  mean  that  H  cannot  effectively  compute  the  same 
result  as  does  G  (possibly  by  using  a  different  algorithm, 
for  example).  Rather,  we  are  proving  lower  bounds  on 
the  ability  of  H  to  perform  the  same  step-by-step  com¬ 
putations  as  G  when  G  is  used  in  a  general  purpose 
way.  Hence  the  term  emulation.  We  suspect  that  our 
pebbling  model  is  probably  the  most  general  model  in 
which  we  could  hope  to  prove  lower  bounds. 

Throughout  the  paper  we  will  make  use  of  the  fact 
that  if  there  is  an  embedding  of  G  in  H  with  congestion 
c,  dilation  d ,  and  load  /,  then  there  is  an  emulation  of  G 
by  H  with  slowdown  0(1  +  c  +  d).  This  follows  for  any 
H  from  the  construction  in  [11],  When  H  is  an  array, 
tree,  butterfly,  or  shuffle-exchange  graph,  the  schedule 
can  be  computed  on-line  using  the  randomized  routing 
algorithm  in  [II]. 

1.4  Our  results 

The  technical  portion  of  this  paper  is  divided  into  five 
sections.  We  commence  in  Section  2  with  some  general 
techniques  for  establishing  the  existence  or  nonexistence 
of  a  work-preserving  emulation.  In  particular,  we  de¬ 
scribe  two  general  methods  for  proving  lower  bounds 
on  the  slowdown  of  a  work-preserving  emulation.  The 
first  method  is  based  on  dilation  considerations  and  ap¬ 
pears  in  Section  2.1.  As  an  application  of  this  method, 
we  prove  that  any  class  of  low  diameter  networks  (such 
as  complete  binary  trees)  cannot  be  emulated  in  real 
time  on  any  class  of  networks  that  has  poor  expansion 
properties  (such  as  arrays  of  bounded  dimension). 

The  second  method  is  based  on  congestion  proper¬ 
ties  and  is  presented  in  Section  2.2.  Here  we  describe  a 
general  method  for  proving  that  a  work-preserving  em¬ 
ulation  requires  a  large  amount  of  time,  or  that  it  is  im¬ 
possible  altogether.  As  an  example,  we  prove  that  any 
work-preserving  emulation  of  a  butterfly  on  an  array  of 


bounded-dimension  requires  exponential  time,  and  that 
it  is  not  possible  to  emulate  an  expander  on  a  butter¬ 
fly  in  work-preserving  fashion.  These  results  provide 
a  curious  contrast  between  the  power  of  a  linear  ar¬ 
ray,  butterfly,  and  an  expander.  By  most  standards,  it 
would  seem  that  a  butterfly  is  closer  in  power  to  an  ex¬ 
pander  than  it  is  to  a  linear  array.  Yet  a  linear  array  can 
emulate  a  butterfly  in  a  work-preserving  fashion,  but  a 
butterfly  (or  most  any  non-expander)  cannot  emulate 
an  expander  in  a  work-preserving  fashion. 

In  Sections  3-6  of  the  paper,  we  focus  on  the  spe¬ 
cial  case  of  emulations  on  arrays,  complete  binary  trees, 
butterflies,  and  shuffle-exchange  graphs,  respectively.  In 
Section  3,  we  prove  tight  bounds  on  the  slowdown  re¬ 
quired  for  an  array  to  emulate  a  tree,  array  or  butterfly. 
In  Section  4,  we  prove  that  there  is  a  work-preserving 
emulation  of  bounded-degree  trees  by  complete  binary 
trees  with  0(log  log  N)  slowdown.  We  also  give  evi¬ 
dence,  but  no  proof,  that  there  is  no  corresponding  real¬ 
time  emulation  for  this  class.  (Proving  that  a  complete 
binary  tree  cannot  emulate  a  complete  ternary  tree  in 
real-time  is  one  of  several  challenging  questions  left  open 
in  this  paper.) 

In  Section  5,  we  show  that  the  class  of  arrays  with 
bounded  dimension  can  be  emulated  in  real-time  on  a 
butterfly.  This  result  is  interesting  because  any  one-to- 
one  embedding  of  an  array  (with  dimension  2  or  more) 
in  a  butterfly  requires  O(loglV)  dilation  [2],  which  sug¬ 
gests  that  a  real-time  emulation  is  not  possible.  The  re¬ 
sult  takes  on  added  significance  given  the  fact  that  many 
parallel  numerical  algorithms  are  array-based  while  sev¬ 
eral  parallel  machines  are  butterfly-based. 

We  also  describe  a  simple  constant-congestion  embed¬ 
ding  of  an  IV-node  shuffle-exchange  graph  in  an  IV-node 
butterfly  in  Section  5.  This  result  has  several  impor¬ 
tant  consequences.  First,  it  can  be  used  to  provide 
an  elementary  proof  that  the  IV-node  shuffle-exchange 
graph  can  be  laid  out  in  0(N2 /  log2  N)  area  and  in 
0(N3'2 /  \og3/2  N)  volume.  Both  results  are  optimal. 
The  area  bound  was  known  previously  [7],  but  the  proof 
was  much  more  difficult  (as  were  the  proofs  for  sev¬ 
eral  nonoptimal  layouts  for  the  shuffle-exchange  graph 
[6,  10,  12,  19]).  The  3-d  layout  bound  is  new  and  was 
not  obtainable  by  any  of  the  previous  approaches  to  the 
2-d  layout  problem.  Second,  we  apply  the  result  to  de¬ 
rive  an  0(log  A)-slowdown  work-preserving  emulation 
of  the  shuffle-exchange  graph  on  the  butterfly. 

In  Section  6,  we  prove  the  reverse,  namely,  that  there 
is  an  C?(log  Ar)-sIowdown  work-preserving  emulation  of 
the  butterfly  on  the  shuffle-exchange  graph.  Taken  to¬ 
gether,  these  results  come  very  close  to  resolving  a  long 
open  question  concerning  whether  or  not  the  butterfly 
and  shuffle-exchange  graph  are  computationally  equiva¬ 
lent.  In  particular,  we  show  that  up  to  NC  emulations, 
the  butterfly  and  shuffle-exchange  graphs  are  equivalent 
in  a  work-preserving  sense.  Thus,  for  many  problems, 
they  can  be  considered  to  be  computationally  equiva- 
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lent. 

As  a  consequence  of  the  emulations  in  Section  6,  we 
also  obtain  a  real-time  emulation  of  bounded-degree  ar¬ 
rays  in  the  shuffle-exchange  graph,  and  we  show  how 
to  sort  N  numbers  with  high  probability  in  0(log  N) 
steps  on  an  JV-node  shuffle-exchange  graph.  Although 
the  proof  of  the  sorting  bound  is  elementary,  it  resolves 
an  open  question  concerning  the  difficulty  of  random¬ 
ized  sorting  algorithms  on  the  shuffle-exchange  graph. 
Previously,  such  an  algorithm  was  known  for  the  but¬ 
terfly  [11,  15,  17]  but  that  algorithm  made  crucial  use  of 
the  recursive  structure  of  the  butterfly,  a  structure  not 
present  in  a  shuffle-exchange  graph. 

1.5  Previous  work 

There  has  been  a  great  deal  of  previous  work  on  graph 
embeddings  with  the  intent  of  showing  that  one  network 
can  or  can’t  emulate  another  network  efficiently  [2,  3,  4, 
5,  11,  16].  Many  of  the  results  were  positive  and  proved 
things  like  “all  jV-node  binary  trees  can  be  emulated  in 
constant  time  on  an  JV-node  hypercube.”  There  were 
also  some  negative  results,  but  because  of  the  lack  of 
a  good  model,  their  significance  is  now  less  clear.  For 
example,  even  though  an  embedding  of  a  mesh  into  a 
butterfly  requires  dilation  Q(log  N),  we  now  find  that  a 
butterfly  can  emulate  a  mesh  with  constant  slowdown. 

The  notion  of  work-preserving  emulations  in  PRAM 
models  has  previously  been  studied  [8,  13]  and  served 
to  motivate  this  work.  Related  problems  of  scheduling 
computations  on  fixed-connection  networks  have  also 
been  studied  [14], 

2  Lower  bounds 

In  this  section  we  present  lower  bounds  on  slowdown 
and  inefficiency.  Loosely  speaking,  these  lower  bounds 
apply  when  the  guest  graph  expands  faster  than  the 
host  graph.  The  first  lower  bound  can  be  used  to  show 
that  any  emulation  of  a  complete  binary  tree  by  a  linear 
array  has  slowdown  Q(Nn/log  Nh)-  The  second  can  be 
used  to  show  that  a  butterfly  cannot  perform  a  work- 
preserving  emulation  of  an  expander  graph,  that  any 
work-preserving  emulation  of  a  butterfly  by  a  linear  ar¬ 
ray  H  requires  slowdown  at  least  2n^w\  and  that  any 
work-preserving  emulation  of  a  k  +  1-dimensional  mesh 
by  a  ^-dimensional  mesh  H  requires  slowdown  at  least 
f All  of  these  lower  bounds  on  slowdown  are 
tight. 

Before  proving  the  lower  bounds,  we  need  to  intro¬ 
duce  some  notation.  For  an  undirected  graph  G  = 
(V,E),  let  <5(u,t>)  be  the  length  (number  of  edges) 
of  the  shortest  path  between  nodes  tt  and  v  in  G. 
Let  SG(u,»)  =  (u  €  ^(u.u)  <  j}  be  the  set  of 
nodes  within  a  distance  i  of  u  in  G  and  let  6G(u,  i)  = 
\B(j(u,i)\.  We  call  &G  the  growth  function  of  G. 


2.1  Distance* based  lower  bound 

The  following  theorem  shows  that  if  the  guest  graph 
grows  faster  than  the  host  graph,  then  any  emulation  of 
the  guest  by  the  host  must  be  slow. 

Theorem  1  Let  H  =  ( Vh,Eh )  be  an  Nn-node  host 
graph  and  G  =  ( Vq ,  Eg )  be  an  No-node  guest  graph, 
and  suppose  that  there  are  integers  th  and  tq  such  that 

TH  Tq 

max  V  bH(u,  i)  <  min  V  M».  J)- 

ti€V*  r—f  v£Va  ' 

1=1  ;=1 

Then  any  emulation  of  Tq  >  tq  steps  of  G  by  H  has 
slowdown 

S  >  ( th  +  l)/2rG. 

Proof:  The  basic  idea  is  to  find  a  sequence  of  Tg/tq 
pebbles  in  any  Tc-step  pebble  DAG  of  G  such  that  each 
pair  of  pebbles  is  separated  by  at  most  rG  guest  time 
steps  but  are  created  in  H  at  least  r#  host  time  steps 
apart.  As  we  shall  see,  such  a  sequence  exists  only  if 
the  slowdown  5  =  Th  /Tq  is  at  least  ( th  +  l)/2  rG. 

We  start  the  sequence  with  the  last  pebble  created 
by  H .  Suppose  that  at  time  Th  some  node  uq  €  Vh 
creates  a  pebble  for  DAG  node  (t>o,<o)i  where  to  =  TG. 
The  pebble  for  (t'o.to)  cannot  be  created  by  H  until 
pebbles  for  all  of  its  predecessors  in  the  DAG  are  cre¬ 
ated.  In  particular,  there  are  at  least  x  &G(vo,  j) 
precedessors  for  time  steps  t0  -  rG  through  to  -  1  We 
want  to  show  that  the  pebble  for  at  least  one  of  these 
predecessors  must  have  been  created  by  the  host  graph 
before  time  Th  -  rH  The  pebble  for  every  predecessor 
of  (vo,to)  that  is  created  at  distance  i  from  uq  in  H 
must  be  created  at  or  before  time  Th  —  *•  Thus  at  most 
m=i  M«o,  *)  pebbles  for  predecessors  of  (t>o,<o)  are 
created  by  H  between  time  steps  Th  —  th  and  Th  —  1. 
Since  ma x^vq,  £,”=i  bH{u,i)  <  min„6v0  EJ=i 
the  pebble  for  some  predecessor  (tq,<i),  <i  >  TG  —  rG, 
must  be  created  by  the  host  graph  at  or  before  time 
Th  -  {th  +  1)- 

We  can  repeat  the  argument  to  find  a  pebble  for  a 
predecessor  (t>2,  <2)1  <2  >  TG  —  “2ro,  of  (tq ,  <1)  that  must 
be  created  by  the  host  at  or  before  time  Th  —2 (th  +  1). 
and  so  on.  Eventually  we  obtain  a  pebble  (vi,<*)  such 
that  tq  >  th  >  Tg  —  krG ■  This  pebble  must  be  created 
by  the  host  at  or  before  time  Th  —  k{rH  +  l).  We  assume 
that  input  pebbles  are  created  at  host  time  step  0,  and 
that  the  emulation  begins  with  time  step  1.  Thus,  Th  — 
k{rn  +  1)  >  0.  Combining  these  inequalities,  we  have 

Th/Tg  >  {th  +  l)/2i'G 

for  Tq  >tg.  □ 

Corollary  2  Any  such  emulation  has  inefficiency 
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Corollary  3  Any  emulation  of  a  complete  binary  tree, 
G,  by  a  k-dimensional  mesh,  H ,  has  slowdown  at  least 

D  ((A^/log*  tfG)1/(t+1)). 

Proof:  Apply  Theorem  1  with  tq  =  ©(logiVo),  and 
th  =0((ATG!ogyVG)1/(fc+1)).  □ 


2.2  Congestion-based  lower  bound 

The  second  lower  bound  requires  a  little  more  notation. 
Let  G  =  ( V,  E)  be  an  undirected  graph  as  before.  For 
a  set  U  C  V,  we  define  the  ^neighborhood  of  U  to  be 
the  set  of  nodes  within  a  distance  i  of  some  node  in 
U,  Afi(U)  =  UU€t/#<j(u,  *)•  We  define  an  (R,f(R))- 
decomposition  of  G  to  be  a  partition  of  V  into  \V\ / R 
sets  of  nodes  (regions)  such  that  each  contains  R  nodes 
and  has  a  1-neighborhood  of  size  at  most  f(R). 

The  last  graph  parameter  that  we  need,  zq,  is  best 
described  in  terms  of  a  simple  game.  The  player  starts 
by  choosing  a  nodes  of  a  connected  graph  G  and  placing 
them  in  a  bag.  The  player  is  given  a  collection  of  ea, 
0  <  e  <  1,  tokens  to  play  with.  The  game  is  played  in 
rounds,  each  consisting  of  two  steps.  In  the  first  step, 
all  of  the  neighbors  of  the  nodes  in  the  bag  are  added  to 
the  bag.  In  the  second  step,  the  player  may  exchange 
tokens  for  nodes  in  the  bag  on  a  one-for-one  basis.  Let 
Xi  be  the  set  of  nodes  in  the  bag  at  the  end  of  round 
i,  and  let  Yj  be  the  set  of  nodes  removed  in  the  second 
step  of  round  i.  Then  Xi  is  given  by  the  recurrence 
-.Y,  =  A/i(Ai_i)  —  Yj.  The  game  ends  when  the  number 
of  nodes  in  the  bag  exceeds  it  capacity,  c,  at  the  end 
of  a  step,  where  c  <  Ng ■  If  it  is  the  number  of  rounds 
played,  then  |Xj|  <  c  for  i  <  k ,  (AT* (  >  c  for  i  =  k, 
and  £*=1  |Y||  <  ea.  The  goal  is  to  play  as  many  rounds 
as  possible.  Let  z<j(a,e,c)  be  an  upperbound  that  is 
non-increasing  in  a  on  the  length  of  the  longest  possible 
game. 


Theorem  4  Suppose  that  H  =  (Vj a,  Eh)  is  an  Nh- 
node  host  graph  with  an  (R,  f(R))-decomposition,  and 
that  G  =  (Vis,  Eg)  is  an  NG-node  guest  graph.  Let 


0  = 


max 


3  NaR  1  Na\\ 

8Nh  '  2 ’  2  Jr 


Then  for  any  emulation  of  G  by  H  where  Tq  >  3/?, 


I  >  min 


f  R  Nh_  \ 

\  Z20f(R)  ’  96/2  / 


Proof:  The  basic  strategy  is  to  show  that  ei¬ 

ther  the  host  spends  a  lot  of  time  passing  pebbles 
across  the  perimeters  of  the  regions  in  the  (R,f(R))- 
decomposition,  or  the  host  spends  a  lot  of  time  creating 
pebbles.  We  will  break  the  T<;  guest  time  steps  into 
blocks  of  3/?  consecutive  steps  and  classify  every  block 


as  either  an  importer  or  a  creator.  If  a  block  is  an  im¬ 
porter,  then  many  pebbles  for  the  block  cross  region 
perimeters.  If  a  block  is  a  creator,  then  some  region 
creates  many  pebbles  for  the  block.  If  the  majority  of 
the  blocks  are  importers,  then  the  time  required  by  the 
host  to  pass  pebbles  across  the  perimeters  of  the  re¬ 
gions  large.  Otherwise,  the  time  required  to  create  the 
pebbles  is  large. 

Before  we  can  get  started  we  need  one  more  piece  of 
notation.  For  each  node  v  in  G  there  is  at  least  one 
pebble  created  by  H  for  each  guest  time  step  t  between 
1  and  Tq.  The  first  pebble  created  for  v  for  time  t  is 
called  the  t-primary  pebble  for  v.  For  each  value  of  t 
there  are  exactly  Ng  t-primary  pebbles.  The  t-primary 
pebbles  are  ordered  according  to  the  order  in  which  they 
are  created  by  H,  with  ties  broken  arbitrarily.  We  call 
the  first  31Vg/4  t-primary  pebbles  the  t-early  pebbles 
and  the  last  ZNq/A  the  t-late  pebbles. 

We  begin  with  the  definition  an  importer  block.  Con¬ 
sider  a  block  from  step  t  to  t  —  30  +  1.  The  aver¬ 
age  number  of  t-early  pebbles  created  by  each  of  the 
Nh/R  regions  in  the  decomposition  of  H  is  at  least 
p  =  ZNgR/^Nh-  We  say  that  a  region  is  t-busy  if  it 
creates  at  least  p/2  t-early  pebbles.  We  say  that  a  f- 
early  pebble  is  t-busy  if  it  is  created  by  a  t-busy  region. 
At  least  half  of  the  t-early  pebbles  are  t-busy.  Thus, 
there  jure  at  least  ZNg/S  t-busy  pebbles.  Suppose  that 
a  t-busy  region  creates  s  >  p/2  t-biisy  pebbles.  We  say 
tha't  the  region  is  an  importer  if  it  imports  at  least  s/2 
pebbles  for  time  steps  between  i  —  1  and  t  —  20.  We 
say  that  a  block  is  an  importer  if  every  t-busy  region  is 
an  importer,  or  if  some  region  imports  at  least  3A<;/16 
pebbles  for  time  steps  between  f  —  1  and  t  —  20.  In  a 
importer  block,  a  total  of  at  least  3iVG / 1 6  pebbles  for 
time  steps  between  t  —  1  and  t  —  20  are  imported  by  all 
of  the  regions. 

If  at  least  half  of  the  T<j/ 20  blocks  are  importers, 
then  we  can  find  a  lower  bound  on  inefficiency  by  com¬ 
puting  the  time  required  to  import  pebbles.  In  this 
case,  the  total  number  of  pebbles  imported  by  all  of 
the  importer  blocks  is  at  least  TqNg /320.  The  host 
time  required  to  import  these  pebbles  is  at  least  T) r  > 
TqNgR/320Nh  f(R),  because  at  each  host  time  step, 
each  of  the  Nh/R  regions  can  import  at  most  f(R)  peb¬ 
bles.  In  this  case, 


I>  R/Z20f(R). 

As  we  shall  see,  if  a  block  is  not  an  importer  then 
some  region  must  create  many  pebbles  for  the  block. 
Hence  the  name  creator.  In  a  creator  block  there  must 
be  some  t-busy  region  TZ  that  creates  s  >  p/2  t-busy 
pebbles  but  imports  fewer  than  s/2  pebbles  for  time 
steps  between  t  —  1  and  t  -  20.  The  f-busy  pebbles 
created  by  R  cannot  be  created  until  pebbles  for  all  of 
their  predecessors  in  the  pebble  DAG  are  created.  Since 
zg(s,  1/2,  Ng/2)  <  zg(p/2,1/2,Ng/2)  <  0,  7Z  imports 
at  most  s/2  pebbles  for  time  steps  between  t  —  1  and 


i 


<  — zG(s,  1/2,  Na/“2)-  Thus  R.  must  create  at  least  Nq/ 2 
pebbles  for  time  step  f  —  zG(s,  1/2,  Nq/ 2).  Furthermore, 
since  TZ  imports  at  most  3iVG/16  pebbles  for  time  steps 
between  t  —  1  and  t  —  20,  it  must  create  at  least  5NG/16 
pebbles  for  every  time  step  between  t  — zG(s,  1/2,  Nq/ 2) 
and  t  —  20.  For  each  of  these  time  steps,  at  least  ^<3/16 
of  the  pebbles  are  created  for  nodes  whose  ( t  —  20)- 
primary  pebbles  are  ( t  —  2/?)~late  pebbles.  We  call  these 
pebbles  the  descendant  pebbles. 

We  have  chosen  the  descendant  pebbles  so  that 
none  are  created  by  H  until  all  of  the  descendant 
pebbles  for  previous  blocks  have  been  created.  The 
early  pebbles  for  all  time  steps  at  or  before  t  —20  — 
2<?(NG/4,  0,3Ng/4)  must  be  created  before  the  (t  —  20)- 
late  pebbles  because  3 Nq/4  nodes  in  G  lie  within  a 
distance  zG(iVG/4,0,31VG/4)  of  the  nodes  correspond¬ 
ing  to  the  first  Nq/ 4  (t  —  2/?)-primary  pebbles.  Since 
*g(^g/4,0,3./Vg/4)  <  0,  the  early  pebbles  for  previous 
blocks  must  be  created  before  the  (<  —  2/?)-late  pebbles. 
Furthermore,  the  (<  —  2/?)-late  pebbles  must  be  created 
before  the  descendant  pebbles,  which  in  turn  must  be 
created  before  the  <-busy  pebbles  for  11. 

If  at  least  half  of  the  blocks  are  creators,  then  we 
can  derive  a  lower  bound  on  inefficiency  by  summing 
the  time  to  create  the  descendant  pebbles  for  each  of 
the  creator  blocks.  For  each  of  TG/ 60  creator  blocks, 
at  least  /?iVG/16  descendant  pebbles  are  created  by  a 
single  region.  The  host  time  for  each  block  is  at  least 
0Ng / 1 6R.  The  host  time  for  all  of  the  creator  blocks  is 
at  least  TgNg/96R  and  the  inefficiency  is  at  least 

I  >  Nh/96R. 


•  Proof: 

Apply  Theorem  4  with  R  —  Q((Nh  log  NG)k^k+l^), 
f(R)  =  0(R<-k~l'>/k),  and  0  =  0(logJVG).  The  inef¬ 
ficiency  is  at  least  I  >  Q({Nh/ log*  D 


Corollary  8  Any  work-preserving  emulation  of  a  j- 
dimensional  mesh  G  by  a  k-dtmensional  mesh  H ,  j  >  k, 
has  slowdown  at  least  Sl(Ng~k^k). 

Proof:  Ap¬ 
ply  Theorem  4  with  R  =  Nn)kRi+l^),  f(R)  = 

0(R^k~l^k),  and  0  =  0(Nq*).  The  inefficiency  is  at 
least  /  >  fl((JVi/IV£)1/j<‘+1>).  □ 

3  Emulations  by  arrays 

Although  the  arrays  cannot  perform  real-time  emula¬ 
tions  of  graphs  with  small  diameter,  we  can  show  that 
they  can  perform  work-preserving  emulations  of  com¬ 
plete  binary  trees,  other  arrays,  and  butterflies.  In  each 
case,  we  find  an  embedding  of  the  guest  graph  into  the 
array  with  acceptable  load,  congestion,  and  dilation. 
The  edges  of  the  guest  graph  are  emulated  by  routing 
packets  between  the  nodes  of  the  linear  array.  All  of  the 
following  results  can  be  shown  to  be  tight  by  Corollar¬ 
ies  3,  8,  and  7. 

Observation  9  An  N -node  k-dimensional 

mesh  can  perform  a  work-preserving  emulation  of  an 
ylfc+i)/*/ log  N-node  complete  binary  tree. 

Proof:  An  N(-k+1^k^/  log  N- node  complete  binary  tree 
can  be  embedded  in  an  N- node  fc-dimensional  mesh 
with  load  0(Nl!k /  log  N),  dilation  0(iV1/l/  log  N),  and 
congestion  0(NLRi+l )).  □ 

Observation  10  An  N-node  k-dimensional  mesh  can 
perform  a  work-preserving  emulation  of  an  N^k-node 
j-dimensional  mesh,  j  >  k. 

Proof:  An  NJ/k-node  /-dimensional  mesh  can  be  em¬ 
bedded  in  an  AT-node  fc-dimensional  mesh  with  load 
congestion  Afb' -*)/*,  and  dilation  1.  □ 

Observation  11  An  Nh  =  nk-node  k-dimensional 
mesh  H  can  perform  a  work-preserving  emulation  of  an 
Nq  =  n2n-node  butterfly  graph  G. 

Proof:  An  n2n-node  butterfly  graph  with  2"  rows  and 
n  columns  can  be  embedded  in  a  Nh  =  n*-node  k- 
dimensional  mesh  with  load  0{2n /nk~l),  congestion 

0(2”/nt_1),  and  dilation  O(n).  □ 

It  is  interesting  to  note  that  every  connected  network 
can  perform  a  real-time  emulation  of  a  linear  array. 
Hence,  Observations  9  through  11  can  be  modified  to 
hold  for  all  connected  networks. 

4  Emulations  by  complete  binary  trees 

4.1  Work- preserving  emulations  of  bounded- 
degree  trees 

In  this  section,  we  show  that  any  N  log  log  jV-node  for¬ 
est  with  maximum  degree  A  can  be  embedded  in  an 


Combining  the  two  cases  proves  the  theorem.  □ 

Corollary  5  A  k-dimensional  mesh  H  cannot  perform 
a  work-preserving  emulation  of  an  expander  graph  G. 

Proof:  Apply  Theorem  4 

with  R  =  Q((Nh  log  NH)kRk+1'>),  f(R)  =  CKftf*-1)/*), 
and  0  =  0(\og(Nn / R))-  The  inefficiency  is  at  least 
/>  n((NH/\o%k  NH)1/(t+l)).  □ 

Corollary  6  A  butterfly  network  H  cannot  perform  a 
work-preserving  emulation  of  an  expander  graph  G. 

Proof:  Apply  Theorem  4  with 

R  =  Q{N h  log  log  Nh/  log  Nh),  f(R)  =  0(R/  log  R), 
and  0  =  0(\og(Nn / R))-  The  inefficiency  is  at  least 
/>Q(logiVH/loglog^H).  □ 

Corollary  7  Any  work-preserving  emulation  of  a  but¬ 
terfly  G  by  a  k-dimensional  mesh  H  has  slowdown  at 
least  2 «(<*). 


JV-node  complete  binary  tree  with  load  0(A  log  log  N), 
congestion  0{  A2  log  log  N) ,  and  dilation  O(logA).  As 
a  corollary,  there  is  a  work-preserving  emulation  with 
slowdown  0(log  log  N)  of  the  <■’  s  of  bounded-degree 
forests  by  the  class  of  complet  iary  trees. 

In  constructing  the  embedding,  we  use  the  following 
weighted  separator  theorem  for  forests. 

Theorem  12  Suppose  that  F  =  ( V ,  E)  is  a  forest  where 
each  vertex  has  been  assigned  some  non-negative  weight. 
Then  it  is  possible  remove  a  set  S  ofk  vertices  such  from 
V  such  that  the  remaining  vertices  can  be  partitioned 
into  two  subforests  F\  and  F2  such  that  no  edge  connects 
a  vertex  m  F\  with  a  vertex  in  F2,  and  each  contains 
at  most  | |(1  4-  (2/3)*/2)/2  vertices  and  at  most  5/6  of 
the  total  weight. 

Proof:  Omitted. 

We  begin  by  using  Theorem  12  to  find  a  set  S 
of  k  £  O(logloglV)  nodes  that  partitions  the  forest 
F  =  ( V ,  E )  into  two  subforests,  each  containing  at  most 
|U|(1  +  l/logiV)/2  vertices.  We  embed  S  at  the  root  of 
the  binary  tree  and  then  recursively  embed  one  of  the 
subforests  in  the  left  subtree  of  the  root,  and  the  other 
in  the  right. 

At  levels  below  the  root,  we  use  Theorem  12  to  si¬ 
multaneously  partition  the  vertices  of  the  forest  and  the 
edges  connecting  the  forest  to  vertices  that  are  embed¬ 
ded  higher  in  the  binary  tree.  Let  Fi  =  (V{,Ei)  be  a 
forest  to  be  embedded  in  a  subtree  rooted  at  a  level  i 
node  Vi  in  the  binary  tree.  Let  N{  be  the  number  of 
edges  connecting  Fi  to  vertices  embedded  higher  in  the 
binary  tree;  IV,  is  the  congestion  of  the  binary  tree  edge 
connecting  Vi  to  its  parent.  We  assign  each  vertex  of  Fi 
a  weight  equal  to  the  number  of  neighbors  it  has  that  are 
embedded  higher  in  the  binary  tree.  Using  Theorem  12, 
we  find  a  set  Si  of  k  vertices  that  partitions  Fi  into  two 
subforests,  each  of  size  at  most  |Vj|(l  +  1/  log  IV)/ 2,  and 
each  having  at  most  (5/6)  JV,-  edges  to  vertices  that  are 
embedded  higher  in  the  tree.  We  embed  the  vertices  of 
Si  at  Vi  and  recursively  embed  one  of  the  subforests  in 
the  left  subtree  of  u,-,  and  the  other  in  the  right  subtree. 

To  limit  the  dilation  to  some  integer  d,  whenever  i 
is  a  multiple  of  d  we  embed  at  t/<  not  only  5,-  but  also 
all  of  the  vertices  in  Fi  that  have  at  least  one  neighbor 
embedded  somewhere  higher  in  the  binary  tree. 

We  must  now  show  how  to  choose  d  so  that  both  the 
congestion  and  the  load  of  the  embedding  are  small. 
Consider  any  simple  path  from  a  level  i  node  v,  in  the 
binary  tree  to  a  level  i  +  d  node,  where  i  is  a  mul¬ 
tiple  of  d.  At  level  i,  we  embed  a  separator  of  size  k  and 
at  most  Ni  other  vertices  that  have  at  least  one  neighbor 
embedded  higher  in  the  tree.  Since  each  of  these  ver¬ 
tices  has  at  most  A  neighbors,  JV,-+1  <  A  k  +  A  Ni.  At 
level  i+1,  we  embed  a  separator  of  size  k  that  partitions 
Fi+i  into  two  subforests,  each  having  at  most  (5/6)JV1+1 
edges  to  vertices  embedded  higher  in  the  binary  tree. 


« 


Thus,  at  level  i  +  2,  we  have  JVi+2  <  (5/6)JVi+1  -I-  Aik. 
In  general,  Ni+j  is  given  by  the  recurrence 


/  Ak  +  ANi  j  = 

\  (5/6)W+i_i  +Ak  1  <j< 


Solving  the  recurrence  yields 


Ni+j  <  6 A k+(b/&)j-lANi. 


We  are  now  in  a  position  to  calculate  the  load  and 
the  congestion.  The  preceeding  argument  shows  that 
for  d  G  O(logA)  and  Ni  €  O(Ak),  we  have  7Vi+d  < 
Ni.  Thus,  in  every  simple  path  between  a  node  at  level 
i  and  a  node  at  level  i  +  d,  where  i  is  a  multiple  of 
A,  the  congestion  starts  at  O(Aifc)  at  level  i,  rises  to 
at  most  0(A2ik)  at  level  i  +  1  and  proceeds  to  drop 
back  down  to  at  most  O(Ak)  at  level  i  +  d.  Thus,  the 
congestion  of  the  embedding  is  at  most  0(  A2  log  log  N) . 
How  large  can  the  load  be?  At  each  node  of  the  binary 
tree  we  embed  a  separator  of  size  k.  For  every  i  that 
is  a  multiple  of  d,  we  also  embed  a  set  nodes  of  size 
Ni  =  O(Ak).  Finally,  at  the  leaves  we  embed  forests  of 
size  N  log  log  JV((  1  +  1/log  N)/ 2)logW,  which  is  at  most 
0(loglog  N).  Thus  the  load  is  at  most  0(Aloglog  N). 


4.2  Congestion  lower  bounds  for  a  complete 
ternary  tree 

In  this  section  we  show  that  any  embedding  of  an  N - 
node  complete  ternary  tree  in  an  JV-node  complete  bi¬ 
nary  tree  with  load  at  most  O(i/)oglog  N)  in  which  the 
leaves  of  the  ternary  tree  are  mapped  to  the  leaves  of  the 
binary  tree  has  congestion  at  least  fi(\/log  log  A').  This 
lower  bound  suggests,  but  does  not  prove,  that  real-time 
emulation  of  a  complete  ternary  tree  by  a  complete  bi¬ 
nary  tree  is  impossible. 

Theorem  13  Any  embedding  of  an  N -node  complete 
ternary  tree  in  an  N -node  complete  binary  tree  with 
load  at  most  0(y/\og  log  N)  in  which  the  leaves  of  the 
ternary  tree  are  mapped  to  the  leaves  of  the  binary  tree 
has  congestion  at  least  I2(v/log  log  N ) . 

Proof:  Omitted. 


5  Emulations  in  a  butterfly  graph 

Before  describing  our  emulations  we  give  some  notation 
concerning  the  butterfly  graph.  Recall  that  a  butter¬ 
fly  graph  node  can  be  represented  by  a  pair  <  i,w  >. 
We  refer  to  i  as  the  node’s  level.  We  refer  to  w  as  the 
node’s  position  in  level  (PIL).  We  consider  the  nodes  of 
the  butterfly  with  the  same  PIL  to  be  in  a  row.  We  con¬ 
sider  the  inputs  of  the  butterfly  to  be  the  nodes  whose 
representatives  are  of  the  form  <  0,  w  >,  i.e. ,  the  level 
0  node  of  a  row.  In  the  following  sections  we  will  con¬ 
nect  the  inputs  of  a  butterfly  to  each  other  via  paths 
through  the  butterfly.  We  make  use  the  following  theo¬ 
rem  of  Benes  [1]. 


^*+1  A+l 

Figure  1:  Division  of  the  mesh  into  submeshes 

Theorem  14  The  inputs  of  an  N  log  N -node  butterfly 
can  be  connected  in  any  permutation  by  a  set  of  paths 
such  that  each  path  has  length  at  most  2  log  N,  and  each 
edge  in  the  butterfly  is  used  at  most  twice  ( once  in  each 
direction). 

5.1  Work-preserving  emulations  of  binary  trees 

When  the  Bhatt,  Chung,  Hong,  Leighton,  Rosenberg 
result  [2]  that  a  butterfly  can  emulate  a  complete  bi¬ 
nary  tree  in  real-time  is  combined  with  the  material  in 
Section  3,  we  find  that  there  is  an  0(log log  N)- time 
work-preserving  simulation  of  the  class  of  binary  trees 
on  the  butterfly.  Whether  or  not  this  emulation  can  be 
performed  in  real-time  remains  an  open  question. 

5.2  Real  time  emulation  of  arrays 

Theorem  15  For  constant  q,  T  steps  on  a  -f/~N  x  •  •  ■  x 
■yN  q  dimensional  mesh  can  be  emulated  in  0(T)  steps 
on  a  butterfly  graph  with  O(N)  nodes. 

Proof.  We  prove  the  theorem  for  q  =  2;  for  other 
values  of  q  the  proof  is  similar.  We  will  only  prove  the 
theorem  when  T  >  log  N ;  when  T  <  log  N  the  proof  is 
similar. 

We  will  prove  the  theorem  using  recursion.  We  will 
divide  the  mesh  into  submeshes  and  the  butterfly  into 
subbutterflies  and  recursively  emulate  each  submesh  in 
a  subbutterfly.  Since  submeshes  will  need  pebbles  com¬ 
puted  in  other  submeshes,  we  will  create  connections 
between  the  submeshes. 

Suppose  that  we  know  how  to  emulate  ft  +  i  steps  of  a 
s*+i  x  Sfc+i  mesh  on  a  butterfly  with  Nk+i  =  njt+i2',‘+l 
nodes,  n*+1  is  a  power  of  two,  and  s£+1  =  rnt+iNt+i, 
where  sfc+1  ,/*+1,  Nt+ 1,  and  mj+1  are  numbers  that 
will  be  specified  later.  To  show  how  to  emulate  ft  steps 
of  a  sit  x  st  mesh  on  a  butterfly  with  nodes,  we 
first  divide  a  st  x  s*  mesh  into  s*+i  x  st  +  i  slightly 
overlapping  submeshes  as  shown  in  Figure  1.  Then  the 


butterfly  is  partitioned  into  subbutterflies  of  size  Nt+i, 
and  one  submesh  is  assigned  to  a  subbutterfly. 

The  emulation  of  the  st  x  st  mesh  will  be  divided 
into  ft/ft+i  phases.  In  each  phase  we  first  attempt 
to  run  ft+i  steps  of  the  emulation  of  each  submesh  . in 
a  subbutterfly.  If  nothing  else  were  done,  any  node  of 
a  submesh  at  distance  6  from  the  border  of  the  sub¬ 
mesh  would  not  be  able  to  be  emulated  for  more  than  6 
steps  because  the  pebbles  that  it  computes  will  depend 
on  pebbles  from  another  submesh.  However,  for  every 
node  v  on  the  border  of  a  submesh  there  is  a  node  v'  in 
another  submesh  emulating  the  same  node  of  the  mesh 
which  will  be  able  to  successfully  emulate  ft+i  steps 
because  it  is  located  at  distance  ft+i  from  the  border 
of  the  submesh.  We  will  show  how  to  provide  a  path 
in  the  butterfly  between  the  two  nodes  in  the  butterfly 
emulating  v  and  v'  of  length  0(nt).  When  the  node 
emulating  v'  computes  v’s  pebbles  it  will  send  copies 
of  the  pebbles  to  the  node  emulating  v  along  this  path. 
Once  the  node  emulating  v  starts  receiving  pebbles  from 
the  node  emulating  v'  it  will  resume  the  emulation.  As 
the  node  emulating  v  resumes  the  emulation,  nodes  that 
were  emulating  nodes  of  the  mesh  that  were  waiting  for 
pebbles  from  v  will  be  able  to  resume  their  emulation. 
In  order  for  all  such  pairs  of  nodes  to  be  able  to  send 
pebbles  back  and  forth  simultaneously  without  slowing 
down  the  emulation,  it  will  be  necessary  to  choose  the 
paths  so  that  a  most  a  constant  number  of  paths  will 
share  an  edge,  and  this  must  Be  true  simultaneously  for 
all  levels  of  the  recursion.  In  order  to  provide  the  paths 
connecting  nodes  in  the  butterfly,  we  will  not  use  all 
subbutterflies  in  the  partition  of  the  butterfly  for  emu¬ 
lating  submeshes;  some  subbutterflies  will  be  used  only 
for  providing  connections  between  subbutterflies. 

We  now  describe  how  to  embed  the  nodes  of  the  mesh 
in  the  butterfly  and  to  choose  the  paths  connecting 
copies  of  nodes. 

So  now  suppose  that  we  have  chosen  the  embedding 
of  the  nodes  of  a  st+i  x  st+i  mesh  in  a  Nt+\  node 
butterfly  and  the  paths  connecting  corresponding  nodes 
within  the  subbutterfly  We  will  further  require  that  for 
each  node  v  on  the  border  or  at  distance  ft+i  from  the 
border  of  the  s*+i  x  si+i  mesh  (we  will  refer  to  the 
set  of  all  such  nodes  as  Ft+i),  that  there  is  some  node 
of  the  butterfly  <  0,x„  >  and  a  path  that  connects 
<  0,  x„  >  to  a  node  <  i,  y„  >  that  emulates  v  in  the  but¬ 
terfly  such  that  pebbles  can  be  sent  between  <  0,  xv  > 
and  <  »,  y„  >  without  slowing  down  the  simulation  of 
the  $t+i  x  st+i  mesh.  Furthermore,  x„  will  have  the 
property  that  6<t+1_  i  •  •  •  &o  equals  10  -  0  where  et+i  is 
a  number  that  will  be  specified  later,  and  for  all  v  in 
Ffc+i,  their  values  of  x„  will  share  a  common  value  of 
&2rk+,-i  that  can  be  chosen  arbitrarily,  where 

6„k+1_ 1  - -  bo  is  the  binary  representation  of  xv.  We 
again  divide  a  st  x  st  mesh  into  submeshes  as  decribed 
in  Figure  1.  Now  however,  we  modify  this  method  for 
dividing  the  mesh  into  submeshes.  We  wish  to  require 
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that  all  nodes  in  Fk  in  the  mesh  will  lie  in  Fk+i  for  any 
submesh  in  which  they  are  contained.  In  order  to  do 
this,  we  will  shrink  the  sizes  of  submesheS  in  at  most 
two  rows  and  two  columns  of  submeshes.  When  we  re¬ 
cursively  emulate  fk+i  steps  of  a  submesh  that  has  had 
its  size  reduced  we  will  consider  it  to  be  part  of  a  larger 
mesh  that  has  dummy  nodes. 

We  will  now  partition  the  butterfly  with  N-  nodes 
into  subbutterflies  with  Nk+i  nodes.  For  a  node  in  a 
butterfly  we  will  denote  the  binary  representation  of  the 
node’s  PIL  by  •  6o-  Each  subbutterfly  will  con¬ 

sist  of  all  nodes  of  the  butterfly  with  the  following  prop¬ 
erties:  there  exists  a  such  that  a  is  a  multiple  of  rii+i 
possibly  zero  and  such  that  all  nodes  in  the  subbutterfly 
share  common  values  of  ba-1  ■  ■  ■  b0  and  bnk  •  •  •  ba+nk+l , 
and  a  <  i  <  a  +  nfc+i  —  1. 

Subbutterflies  will  be  used  to  emulate  submeshes. 
However,  we  will  not  use  all  butterflies  to  emulate  sub¬ 
meshes;  some  subbutterflies  will  be  used  to  create  con¬ 
nections  between  subbutterflies  that  will  be  simulating 
submeshes.  We  will  not  use  a  subbutterfly  if  there  there 
exists  y  such  that  y  is  a  multiple  of  n, t+i,  y  >  a  (where 
a  is  the  a  used  to  describe  the  nodes  in  the  subbutter¬ 
fly)  and  67+fk+l_i  •  -  -  &7  equals  the  string  10  ■  •  •  u  for  all 
nodes  in  the  subbutterfly,  or  if  a  >  0  and  6{Jt+1_i  -  bo 
equals  the  string  10  -  ■  ■  0  or  0  •  ■  •  0. 

We  must  make  sure  that  the  number  of  subbutterflies 
to  be  used  for  simulating  submeshes  is  greater  than  or 
equal  to  the  number  of  submeshes  to  be  emulated.  The 
number  of  submeshes  is  at  most 

( _ ft _ 

\sk+i  ~  fk+i 


(the  additive  two  is  due  to  the  shrinkage  of  the  size  of 
some  submeshes).  The  total  number  of  subbutterflies  in 
the  partition  of  the  butterfly  is  Nk/Nk+i .  The  number 
of  subbutterflies  that  will  not  be  used  for  simulating 
submeshes  is  at  most 

onk-nk+i-fk  +  i 

n»+ 1/ 


Thus  there  will  be  enough  subbutterflies  if 

(-v-+2)’ 

\Sk  +  l-Jk  +  l  ) 

<  (  nk 

~  -Ni+i  \ni+i 
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We  now  describe  how  to  choose  the  paths.  We  wish 
to  choose  a  path  connecting  two  nodes  u  and  u'  that 
are  emulating  nodes  v  and  v'.  We  will  again  use  a 
to  describe  the  subbutterfly  in  which  u  is  located  as 
we  did  previously.  Since  v  is  in  Ffc+1  for  its  submesh, 
we  know  that  there  exists  some  node  uj  in  the  sub- 
butterfly  such  that  ui’s  level  in  the  butterfly  is  a  (it 
has  level  zero  when  considered  as  part  of  the  subbutter¬ 
fly),  and  such  that  the  bits  6ct+eil+1_i  •  ••6CI  in  ui’s  PIL 


equals  10  -  0,  and  such  that  for  all  v  in  Fk+i  in  the 
submesh,  the  PIL’s  for  the  respective  PIL’s  of  uj  share 
a  common  value  of  &<*+2<k+,_i  ■■■ba+(k+l  that  can  be 
chosen  arbitrarily.  We  choose  60+2f»+i-i  ■  •  ba+Ck+l  to 
be  6fl+1_i  •  •  •  &o,  which  is  the  same  for  all  nodes  in  u’s 
subbutterfly.  We  now  choose  u2  to  be  the  node  in  the 
butterfly  at  level  zero  whose  PIL  is  obtained  by  convert¬ 
ing  the  ffc+i  least  significant  bits  of  uj ’s  PIL  to  10  0. 

By  our  choice  of  subbutterflies  to  be  used  for  simulat¬ 
ing  submeshes  and  our  choice  of  ba+2Ck+l~i  ■  ■  ba+tk+l 
for  ui’s  PIL  we  know  that  the  paths  from  ui  to  u2  for 
different  choices  of  v  are  disjoint.  One  similarly  chooses 
nodes  in  the  buttefly  u\  and  u 2  for  v' .  For  all  choices 
of  U2  and  u'2  we  now  choose  paths  connecting  u2  and 
u2  by  routing  a  permutation  through  nodes  of  the  but¬ 
terfly  which  have  PIL’s  whose  e^+i  least  significant  bits 
equals  10  -  -  -  0  using  one  pass  up  through  the  butterfly 
and  one  pass  down  [lj.  None  of  these  paths  will  conflict 
with  any  previously  chosen  paths. 

To  finish  the  description  of  the  embedding,  we  must 
show  that  for  each  node  v  in  Fk  in  the  mesh  being  em¬ 
ulated,  that  there  is  a  node  u  in  the  butterfly  that  can 
be  connected  by  a  path  of  length  O(nt)  to  some  node 
w  in  the  butterfly  which  is  emulating  u  so  that  pebbles 
can  be  sent  from  u  to  w  or  w  to  u  without  slowing  down 
the  simulation  of  the  mesh,  and  such  that  u  is  chosen  so 
that  it  has  level  zero,  that  the  £*  least  significant  bits  of 
its  PIL  equal  10  -  -  -  0,  and  so  that  62ek_ i  -  -btk  is  some 
arbitrary  chosen  number  that  is  common  for  all  u.  We 
first  assign  nodes  in  the  mesh  in  Fk  to  nodes  in  the  but¬ 
terfly  with  the  required  characteristics,  so  that  at  most 
one  node  of  the  mesh  is  assigned  to  a  node  in  the  butter¬ 
fly.  For  this  to  be  possible  there  must  be  enough  nodes 
in  the  butterfly  with  the  required  properties,  and  this 
will  be  true  if 


log8si  <  nk  -  2efc .  (2) 

We  already  know  that  for  a  node  v  in  Fk,  that  there  will 
be  some  node  u'  in  the  butterfly  with  level  zero  whose 
€k+i  least  significant  bits  equal  10  -  •  -  0  which  is  con¬ 
nected  by  a  path  of  length  O(nk)  to  w\  this  is  because 
when  we  divided  the  mesh  into  submeshes,  we  required 
v  to  be  located  in  Fk+ x  of  any  submesh  in  which  it 
was  contained,  and  we  have  previously  described  a  path 
from  w  to  the  desired  node  u' .  We  now  again  connect 
all  corresponding  pairs  of  u’s  and  uns  using  permutation 
routing  as  before. 

We  now  choose  the  values  of  s*,  /*,  Nk,  ft  and  nu  so 
that  (1)  and  (2)  are  satisfied.  We  first  denote  by  w(AT) 
the  smallest  value  of  k  such  that  rV20”*  <  2.  We  let 
tk  =  30  l°g  nk-  We  let  so  =  y/N,  and  for  k  >  0,  choose 
Sk  and  nk  so  that  N 10  <  s*  <  (AT10  k)2,  m  is  a  power 

of  two  (Nk  =  nt2n*),  and 


w(N) 

Nk  =  si  ]][  mj , 

i  —  k+l 


where 


m 


( 


S*  -  fk 


We  know  that  we  can  choose  such  as*  since  for  all 
possibles  values  of  s,  in  the  specified  range  the  product 

u,(N) 

n 

j=k+ 1 


is  bounded.  We  also  choose  fo  —  min{T,  y/sT}, 

and  for  k  >  2,  /*  =  y/sZ. 

We  now  consider  the  time  required  for  the  emulation. 
Let  7*  be  the  time  to  emulate  /*  steps  of  a  s,  x  s*  mesh 
on  a  Nk  node  butterfly.  The  emulation  is  divided  into 
fk/fk+x  phases.  Each  phase  requires  time  7,+1  +  0(n,) 
and  nk  is  0(logs,).  Thus 

Tk  =  ~—(Tk+x  +  0(logs,)) 

Jk  +  l 

and  therefore  the  total  time  for  the  emulation 
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/ojogs, 
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5.3  A  constant  congestion  embedding  of  the 
shuffle-exchange  graph  in  a  butterfly 

In  this  section,  we  show  how  to  embed  an  iV-node 
shuffle-exchange  graph  in  an  0(iV)-node  butterfly  graph 
with  constant  congestion  and  0(log  N)  dilation. 

The  jV-node  shuffle-exchange  graph  is  defined  for  ev¬ 
ery  N  which  is  a  power  of  two.  Each  node  of  the 
(N  =  2*)-node  shuffle-exchange  graph  is  associated 
with  a  unique  fc-bit  binary  string  a,_i...ao.  We  call 
this  string  the  label  of  the  node.  Two  nodes,  w  and  w', 
are  linked  via  a  shuffle  edge  if  w'  is  a  left  or  right  cyclic 
shift  of  w.  Two  nodes,  w  and  w' ,  are  linked  via  an  ex¬ 
change  edge  if  w  and  w'  differ  in  the  least  significant 
bit,  ao. 

A  constant  congestion  embedding  requires  that  very 
few  edges  of  the  shuffle-exchange  be  mapped  to  long 
(more  than  constant  length)  paths  in  the  butterfly.  In 
addition,  these  paths  must  not  overlap  each  other  very 
often.  To  ensure  this,  we  use  the  afore-mentioned  the¬ 
orem  of  Benes  concerning  a  butterfly  graph’s  ability  to 
embed  a  permutation  on  its  inputs. 


That  is,  if  the  set  of  long  paths  can  be  decomposed 
into  a  constant  number  of  (partial)  permutations  of  the 
inputs  of  the  butterfly,  the  long  paths  can  be  embedded 
with  constant  congestion.  It  is  easy  to  see  tba*  we  can 
embed  the  long  paths  in  this  manner  when  there  are  at 
most  a  constant  number  of  endpoints  of  long  paths  in 
any  single  butterfly  row.  (  We  route  a  path  from  each 
endpoint  to  to  the  input  of  its  row.  This  leaves  us  with 
a  constant  number  of  “Benes  routings”  to  perform.) 

So  we  map  the  nodes  of  a  shuffle-exchange  graph  to 
the  nodes  of  a  butterfly  graph  so  that 

1.  at  most  a  constant  number  of  shuffle-exchange 
nodes  are  mapped  to  any  one  butterfly  node,  and 

2.  each  butterfly  row  contains  at  most  a  constant 
number  of  shuffle-exchange  nodes  which  have  any 
neighbor  mapped  to  a  distant  node  in  the  butterfly. 

Short  paths  only  contribute  constant  congestion  since 
they  have  constant  length.  Long  paths  only  contribute 
constant  congestion  since  we  can  route  any  permutar- 
tion  with  congestion  2,  and  we  only  need  to  route  a 
constant  number  of  (partial)  permutations.  Also,  the 
length  of  the  short  paths  is  constant  and  the  long  paths 
is  O(logn). 

In  particular,  we  map  the  nodes  of  a  N  —  2n- 
node  shuffle-exchange  graph  to  the  nodes  of  a  (n  +  2  - 
logn)2"+2~logn  «  4JV-node  butterfly  graph.  Each  node 
in  this  IV-node  shuffle-exchange  graph  has  n  bits  in  its 
label.  A  node  in  the  butterfly  can  be  specified  by  a 
row  represented  by  n  +  2  —  log  n  bits,  and  a  level  in  the 
row.  The  level  in  the  row  corresponds  to  a  bit  that  can 
be  flipped  to  enter  another  row.  Thus,  we  first  asso¬ 
ciate  a  shuffle-exchange  node  with  a  particular  row  of 
the  butterfly  by  removing  log  n  —  1  adjacent  bits  of  its 
label  none  of  which  are  the  least  significant  bit,  then  we 
pick  the  level  in  the  row  which  corresponds  to  where  the 
least  significant  bit  of  the  shuffle-exchange  node  appears 
in  the  row’s  representation. 

We  map  a  shuffle-exchange  node  w  to  a  node  in  the 
butterfly  as  follows, 

1.  Consider  the  longest  string  of  zeros  in  w  ignoring 
the  least  significant  bit,  break  ties  by  choosing  the 
first  one  from  the  left. 

2.  Pick  out  logn  —  1  bits  as  follows; 

(a)  If  possible  choose  the  logn  -  1  bits  after  the 
zeros  and  before  the  lsb. 

(b)  otherwise  if  possible  choose  the  log  n  -  1  bits 
preceding  the  longest  string  of  zeros, 

(c)  otherwise  choose  the  last  log  n  -  1  bits  of  the 
string  of  zeros  (note  that  in  this  case  more 
than  n  —  2 log  n  bits  are  zeros). 


3.  Treat  these  bits  as  a  number  (it  will  be  in  the  range 
0...§),  call  this  number  s,  and  the  sequence  of  bits 

4.  Remove  the  bits  of  s  from  w,  extend  the  chosen 
string  of  zeros  on  the  right  (left)  by  a  01  (10)  if  the 
bits  were  removed  from  the  right  (left)  of  the  block 
of  zeros,  and  cyclic  shift  the  resulting  string  so  that 
s  bits  appear  after  the  longest  string  of  zeros,  this 
specifies  the  row. 

Symbolically,  we  map  w  =  zOka,yb  to  row  uOfc+1lv, 
or  we  map  w  =  zas0kyb  to  row  ul0*+1t>,  with  ybz  =  vtt 
and  |t/|  =  s.  (Note  that  we  map  to  a  row  with  a  unique 
longest  string  of  zeros  not  straddling  the  bit  which  is 
at  the  level  of  the  butterfly  node.)  It  is  easy  to  see 
that  the  least  significant  bit  of  to,  6,  is  somewhere  in 
the  representation  of  the  row.  We  choose  the  level  in 
the  row  to  correspond  to  the  position  of  6  in  the  row’s 
representation. 

We  must  argue  that  the  mapping  achieves  condition 
1  and  2  above. 

First,  we  introduce  some  more  notation.  We  define 
a  necklace  to  be  a  set  of  shuffle-exchange  nodes  which 
are  connected  only  by  shuffle  edges.  Alternatively,  a 
necklace  is  a  set  of  nodes  having  labels  which  are  cyclic 
shifts  of  each  other.  A  necklace’s  label  is  the  lexico¬ 
graphically  minimum  label  of  its  nodes.  We  can  specify 
a  shuffle-exchange  node  by  the  label  of  its  necklace  and 
the  position  of  the  least  significant  bit  of  the  node’s  label 
in  the  necklace’s  label. 

We  define  the  domain  of  a  butterfly  node  to  be  the 
set  of  shuffle-exchange  nodes  that  are  mapped  to  it  by 
our  mapping. 

Now  we  show  that  the  mapping  is  at  most  two  to  one. 
That  is,  given  a  butterfly  node  (p,  r)  we  can  describe 
at  most  two  shuffle-exchange  nodes  that  could  possibly 
be  mapped  to  (p,  r)  as  follows.  Recall  that  a  butterfly 
node  (p,  r)  has  all  the  bits  of  w  in  r’s  binary  represen¬ 
tation  except  for  a,.  And  these,  we  recover  by  finding 
the  length  of  the  string  after  the  longest  group  of  zeros 
in  r’s  binary  representation  not  straddling  the  pth  bit. 
We  know  that  we  have  to  reinsert  them  either  directly 
before  or  directly  after  that  group  of  zeros.  This  gives 
us  all  the  bits  of  the  domain  nodes  except  for  a  cyclic 
shift  uncertainty.  Thus,  the  domain  of  (p,  r)  can  only 
be  nodes  from  two  necklaces.  Furthermore,  the  least 
significant  bit  of  the  nodes’  labels  is  uniquely  specified 
by  the  place  where  the  pth  bit  of  r’s  binary  represen¬ 
tation  occurs  in  the  necklaces’  labels.  Thus  only  two 
shuffle-exchange  nodes  can  be  mapped  to  any  node  in 
the  butterfly. 

Finally,  we  argue  that  we  map  at  most  a  constant 
number  of  shuffle  exchange  nodes  with  distant  neighbors 
to  any  butterfly  row. 

Notice  that  we  always  ignore  the  value  of  the  least 
significant  bit  in  the  mapping  of  shuffle-exchange  nodes 


to  butterfly  nodes.  Thus  the  mapping  maps  two  shuffle- 
exchange  nodes  to  two  nodes  that  only  differ  in  the  bit 
that  can  currently  be  changed  by  a  butterfly  edge.  Thus, 
any  exchange  edge  needs  only  flip  the  bit  at  the  node’s 
level,  which  only  requires  a  path  of  length  2.  Thus  all 
exchange  edges  are  embedded  in  short  paths. 

Now  consider  the. shuffle  edges.  We  show  that  at  most 
a  constant  number  of  shuffle  edges  leave  any  row  of  the 
butterfly.  (It  is  easy  to  see  that  all  the  shuffle  edges  in  a 
row  are  mapped  to  single  edges  in  the  butterfly  graph.) 
Again,  consider  the  inverse  mapping  of  a  butterfly  node, 
(p,r),  to  two  shuffle-exchange  nodes.  The  necklaces  of 
the  domain  nodes  of  row  r’s  nodes,  are  the  same  for 
most  of  the  row.  They  change  only  at  certain  transition 
levels  in  the  row;  levels,  p,  in  the  row  where  the  position 
of  the  longest  string  of  zeros  not  straddling  p  changes, 
or  levels  in  the  row  where  we  become  unsure  or  sure  of 
which  side  of  the  zero6  to  replace  the  removed  bits,  a,. 

The  position  of  the  longest  string  of  zeros  not  strad¬ 
dling  p  only  changes  at  two  points;  inside  the  row’s 
unique  longest  string  of  zeros.  When  the  row  level  is 
within  logn  bit  positions  to  the  right  of  the  longest 
string  of  zeros,  we  know  that  pieces  of  two  shuffle- 
exchange  necklaces  could  have  been  mapped  to  the  row. 
Outside  this  range  we  know  that  only  one  necklace  is 
mapped  to  the  row:  Inside  the  group  of  zeros  the  bits 
were  definitely  taken  out  before  the  group  of  zeros,  and 
further  to  the  right  they  were  definitely  taken  out  after 
the  group  of  zeros.  Thus  entering  this  stretch  and  leav¬ 
ing  this  stretch  gives  us  two  more  bad  levels.  Thus  we 
have  four  transition  levels  in  all,  and  for  each  of  these 
at  most  four  necklaces  could  enter  or  leave  the  row  at 
any  of  these  levels.  Thus  at  most  16  long  shuffle  edges 
can  have  endpoints  in  this  row.  (Careful  counting  can 
reduce  this  number  to  6.) 

Thus  at  most  16  long  edges  are  adjacent  to  any  row 
of  the  butterfly.  This  satisfies  condition  2,  above. 

Thus,  the  shuffle-exchange  graph  can  be  embedded  in 
the  butterfly  with  constant  congestion. 

5.4  Application  to  optimal  area  and  volume 
layouts  for  the  shuffle-exchange  graph 

The  N- node  butterfly  can  be  laid  out  in  0(N2/ log2  N) 
area  (trivially)  and  in  0(N3^2 /  log3^2  N)  volume  [20]. 
Since  the  A^-node  shuffle-exchange  graph  can  be  embed¬ 
ded  in  the  A^-node  butterfly  with  constant  congestion, 
we  can  simply  blowup  these  layouts  by  a  constant  fac¬ 
tor  to  obtain  layouts  for  the  shuffle-exchange  graph  with 
equivalent  area  and  volume. 

5.5  A  work  preserving  emulation  of  a  shuffle- 
exchange  graph 

We  construct  an  O(logAT)-  step  work-preserving  sim¬ 
ulation  of  the  shuffle-exchange  graph  on  the  butter¬ 
fly  by  first  embedding  the  shuffle-exchange  graph  in 


an  N  log  N -node  butterfly  with  constant  congestion, 
and  then  embedding  the  A'dogA'mode  butterfly  in  an 
N- node  butterfly  in  the  natural  way.  It  is  not  diffi¬ 
cult  to  show  that  the  JV-node  butterfly  can  then  simu¬ 
late  the  N  lg  IV-node  shuffle-exchange  in  0(log  N)  steps. 
Whether  or  not  there  is  a  real-time  emulation  remains 
an  interesting  open  question. 

6  Emulations  in  a  shuffle-exchange 
graph 

6.1  Work  preserving  emulations  of  arbitrary 
binary  trees 

It  is  well  known  that  the  shuffle-exchange  graph  can 
emulate  a  complete  binary  tree  in  real  time.  Thus 
by  the  results  of  Section  4,  we  know  that  there  is 
an  0(log  log  N)~time  work-preserving  emulation  of  the 
class  of  binary  trees  on  the  shuffle-exchange  graph. 
Whether  or  not  this  emulation  can  be  made  real-time 
remains  an  open  question. 

6.2  A  constant- dilation  embedding  of  N *  dis¬ 
tinct  jV1-£-node  butterflies 

A  shuffle-exchange  graph  of  size  N  can  hold  AT'  distinct 
iV1-£-node  butterfly  graphs  for  0  <  e  <  1  with  max  load 
and  congestion  of  0(l/e). 

We  illustrate  this  by  proving  it  for  e  =  1/2  - 
log(l/21og  AT).  That  is,  we  embed  Mf  log  M  distinct 
M  log  Af -node  butterfly  graphs  in  an  N  =  M2-node 
shuffle-exchange  graph  with  constant  congestion  and 
constant  dilation.  We  assume  that  M  =  2*.  Thus 
each  row  of  the  butterfly  can  be  represented  by  a  k- 
bit  string,  and  each  node  of  the  shuffle-exchange  can 
be  represented  by  a  2A:-bit  string.  A  similar  result  was 
proved  by  Raghunathan  and  Saran  [16]. 

To  map  Mf  log  M  butterflies  to  the  shuffle-exchange 
graph,  we  use  the  following  easily  proven  lemma. 

Lemma  16  The  set  of  k  =  log  M -bit  strings  has  at 
least  M/2  log  M  nonintersecting  subsets  of  log  M  dis¬ 
tinct  strings  which  are  cyclic  shifts  of  each  other. 

For  each  of  these  groups  we  pick  the  lexicographically 
minimum  string  to  represent  the  group.  We  associate 
the  Mf  log  M  butterflies  two  to  one  with  the  M/2  log  M 
groups’  representative  strings.  Say  butterfly  i  is  associ¬ 
ated  with  string  w ’ .  We  map  a  node  ( p ,  r)  in  butterfly 
i  to  a  shuffle-exchange  node  by  shuffling  the  bits  of  Wi 
with  the  bits  of  r’s  representation,  and  choosing  the 
current  bit  to  be  under  the  image  of  rp.  That  is,  node 
(p,  r)  in  butterfly  i  is  mapped  to  shuffle-exchange  node 
rlw\...r£w'p...rnw'p. 

From  a  shuffle-exchange  node  we  can  recover  the  rep¬ 
resentative  string  Wi  by  picking  out  every  other  bit  and 
shifting  to  the  lexicographically  minimum  string.  We 
finding  the  row  string  by  picking  out  the  other  bits  and 


shifting  by  the  same  amount.  The  position  in  the  row 
is  clearly  the  number  of  shifts  we  used  to  get  to  u>,-  and 
the  row  number. 

To  finish,  we  observe  that  each  edge  in  any  of  the 
butterflies  is  mapped  to  a  path  of  length  at  most  three 
in  the  shuffle-exchange  graph  since  we  either  shift  twice 
to  reach  (p  4-  l,r)’s  image,  or  we  exchange  the  current 
bit  and  shift  twice  to  reach  (p+  1, ri..r^...rn)’s  image. 

Thus  we  cam  embed  /  log  \/N  log  \/A-node 
butterflies  in  an  AT-node  shuffle-exchange  with  max  load 
2,  and  dilation  3. 

This  technique  can  be  extended  to  prove  that  for  any 
constant  0  <  c  <  1,  N*  distinct  Nl~‘  butterfly  graphs 
can  be  embedded  in  an  Af-node  shuffle-exchange. 

6.3  Application  to  sorting  on  a  shuffle- 
exchange  graph 

It  is  known  that  an  A/-node  butterfly  can  sort  N  packets 
with  high  probability  in  0(log  N)  steps  [11,  15,  17].  The 
result  does  not  directly  extend  to  the  shuffle-exchange 
graph  because  the  shuffle-exchange  graph  does  not  have 
the  nice  recursive  structure  possessed  by  the  butterfly. 
However,  by  combining  the  embedding  result  of  Sec¬ 
tion  6.2,  the  butterfly  sorting  algorithm  in  [11],  and  the 
columnsort  algorithm  of  [9],  we  can  obtain  an  algorithm 
for  sorting  N  packets  on  an  N-node  shuffle-exchange  in 
0(log  N)  steps  with  high  probability. 

6.4  Real  time  emulations  of  arrays 

By  combining  a  single  level  of  the  kind  of  analysis  in 
Section  5.2  with  the  result  of  Section  6.2,  we  can  emulate 
an  array  in  real  time  on  a  shuffle-exchange  graph.  This 
is  despite  the  fact  that  any  0(1)  to  1  embedding  of  an 
Ar-node  array  (with  dimension  2  or  more)  in  a  shuffle 
exchange  graph  has  dilation  Q(log  log  N)  [2]. 

6.5  A  work  preserving  emulation  of  the  butter¬ 
fly 

By  using  standard  techniques  in  routing  normal  hy¬ 
percube  algorithms,  it  is  easily  shown  that  there  is  an 
O(log  A7)-step  work-preserving  simulation  of  a  butterfly 
on  a  shuffle-exchange  graph.  Whether  or  not  there  is  a 
real-time  simulation  remains  an  important  open  ques¬ 
tion. 

7  Remarks  and  open  questions 

There  are  many  questions  left  open  by  this  paper.  We 
list  a  few  of  them  in  what  follows. 

1 .  Is  there  a  real-time  simulation  of  a  complete  ternary 
tree  on  a  complete  binary  tree? 

2.  Is  there  a  (universal)  class  of  bounded-degree 
graphs  that  can  simulate  the  class  of  all  bounded- 
degree  graphs?  (If  so,  they  must  be  expanders.) 


1  T 


3.  Is  there  a  real-time  simulation  of  a  butterfly  on  a 
shuffle-exchange  graph  or  vice-versa? 

4.  Can  the  notion  of  work-preserving  be  meaningfully 
modified  to  incorporate  measures  such  as  VLSI  lay¬ 
out  area? 

5.  Are  meaningful  results  possible  if  we  consider  simu¬ 
lations  that  are  not  work-preserving,  but  which  are 
close  to  work-preserving  (e.g.,  we  allow  inefficiency 
of  ©(log  iV))? 
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